A[)7 37567 


AN  ALGORITHM  FOR  CRITICAL  VALUES  OF  THE 
TWO-GROUP  RANK  DISTANCE  CLASSIFICATION  STATISTIC 


HARRY  M.  HUGHES,  Ph.  D. 
RiCHARD  C.  McNEE  ,  M.S. 


Approved  for  public  release;  distribution  unlimited. 


UNCIASSIFIED 


Security  Cla««ific»tion 


DOCUMENT  CONTROL  DATA  ■  R  &  D 

(Security  deaaillcetion  oi  title,  body  of  abattact  and  indexing  annotation  mutl  be  entered  when  the  overall  report  la  dmeeilled) 

I  originating  activity  (Corporate  author)  1 

USAF  School  of  Aerospace  Medicine 

Aerospace  Medical  Division  (AFSC) 

Brooks  Air  Force  Base,  Texas  78235 

[2 a.  REPORT  SECURITY  CLASSIFICATION 

|  Unclassified 

<2b.  GROUP 

3  REPORT  TITLE 

AN  ALGORITHM  FOR  CRITICAL  VALUES  OF  THE  TWO- GROUP  RANK  DISTANCE  CLASSIFICATION 
STATISTIC 

4.  DESCRIPTIVE  NOTES  (Type  of  report  and  Incluaive  da  tea) 

October  1971 

9  Au  thorisi  (Firai  name,  middle  Initial,  laet  name) 

Harry  M.  Hughes 

Richard  C.  McNee 

8-  REPORT  DATE 

December  1971 

•A.  CONTRACT  OR  GRANT  NO- 

b.  PROJECT  NO.  6319 

c.  Task  No.  631901 

d  Work  Unit  No.  631901012 

9a.  ORIGINATOR’S  REPORT  NUMBER(S) 

SAM-TR-71-50 

9b.  other  REPORT  NOU)  (Any  other  numbere  that  may  be  maalgned 
thle  report) 

»0.  D'S^RI  0UTION  STATEMENT 

Approved  for  public  release;  distribution  unlimited. 

II  SUPPLEMENTARY  notes 

IJ.  iPONtORlNC  MILITARY  ACTIVITY 

USAF  School  of  Aerospace  Medicine 

Aerospace  Medical  Division  (AFSC) 

Brooks  Air  Force  Base,  Texas  78235 

13  AB*Tf>»CT 


The  rank  distance  method  is  restated  for  discriminating  between  two  groups 
using  a  single  variable.  A  formula  is  derived  for  the  probability  of  misclassifica- 
tions  in  a  training  set  of  n  observations  from  each  group.  Knowledge  of  this 
probability  permits  inference  as  to  the  appropriateness  of  the  variable  for  dis¬ 
crimination  purposes.  The  formula  is  exact  only  for  values  of  U  below  a  bound 
which  is  proportional  to  n.  A  table  of  numerical  values  indicates  the  formula  to 
be  useful  up  to  about  n  =  25  but  not  beyond  n  =  40. 


DD  /°R: .,1473  _ UNCLASSIFIED 

”  Security  (Classification 


UNCLASSIFIED 

Security  CUimficatian 


»  - 


FOREWORD 


during  Octobei*k1971  ^The**1  ^  Biomet^ics  Division  under  task  No.  631901 
1971. 8  1971‘  The  Paper  W3S  su^mitted  for  publication  on  9  November 


Ibis  report  has  been  reviewed  and  is  approved. 


BVAM  R.  GOLWft,  Colonel,  USAF,  MC 
CoaBsnder 


ii 


ABSTRACT 


1 


The  rank  distance  method  is  restated  for  discriminating  between  two 
groups  using  a  single  variable.  A  formula  is  derived  for  the  probability 
of  misclassif ications  in  a  training  set  of  n  observations  from  each  group. 
Knowledge  of  this  probability  permits  inference  as  to  the  appropriateness 
of  the  variable  for  discrimination  purposes.  The  formula  is  exact  only 
for  values  of  U  below  a  bound  which  is  proportional  to  n.  A  table  of 
numeric  values  indicates  the  formula  to  be  useful  up  to  about  n  =  25  but 
not  beyond  n  =  40. 


/ 

AN  ALGORITHM  FOR  CRITICAL  VALUES  OF  THE 
TWO- GROUP  RANK  DISTANCE  CLASSIFICATION  STATISTIC 


I.  INTRODUCTION 

The  rank  distance  method  of  discriminating  between  two  groups  has 
been  described  in  a  previous  report ^  for  the  case  of  two  groups.  It  has 
the  advantage  of  being  applicable  to  a  wide  variety  of  measures  including 
those  whose  scale  may  not  be  very  well  determined  but  whose  ordering  is 
well  known.  We  addressed  there  the  problem  of  judging  how  well  a  partic¬ 
ular  variable  discriminates  by  noting  the  number  of  misclassif ications  of 
a  training  set  using  that  single  variable  alone.  The  exact  distribution 
of  the  number  of  misclassifications ,  under  the  hypothesis  of  both  samples 
of  n  being  from  the  same  population,  was  computed  and  presented  for 
values  of  n  up  to  12.  Computation  beyond  that  value  would  require  an 
uneconomic  amount  of  computer  time. 

We  here  restate  the  rank  distance  method,  derive  a  formula  for  the 
lower  tail  of  the  misclassif icat ion  distribution  under  the  null  hypothesis 
and  table  some  of  its  numeric  values. 

II.  THE  RANK  DISTANCE  METHOD 

Consider  a  sample  of  n  individuals  from  each  of  two  defined  groups. 
Let  be  the  observation  on  the  kth  individual  known  to  be  from  group  A. 

Let  Ypjj  be  the  observation  on  the  k1-^  individual  known  to  be  from  group  6. 

The  and  Yg^  are  ranked  together  in  increasing  order  from  1  to  2n.  We 

then  divide  the  rank  assigned  to  Yj^  by  2n  to  obtain  r^  for  k  =  1,  2, 

...,  n,  and  j  =  A,  B.  Thus  far  we  have  a  training  set  of  n  fractions 
(called  ridits  by  Bross)  from  each  group.  We  next  calculate  the  mean 
ridit  for  each  group,  fj  ,  and  proceed  to  classify  each  observation  as 
belonging  to  the  group  to  whose  mean  ridit  it  is  closer.  Finally,  we 
tend  to  select  those  measures  which  produce  a  number  of  misclassifications 
in  the  training  set  whose  cumulative  probability  is  small  under  the  null 
hypothesis  of  equal  distributions  within  the  groups. 

III.  DERIVATION  OF  FORMULA 

The  probability  of  obtaining  exactly  2k  misclassifications  in  the 
training  set  under  the  null  hypothesis  can  be  expressed  by  the  formula 

2 

n!nl 
<2n)  .' 


Prob(U  =  2k)  =2 


k.'  (n-k) ! 


^Hughes,  H.  M. ,  and  R.  C.  McNee.  Rank  distance  to  choose  discrimi 
nators  for  two  groups.  SAM-TR-71-40,  Oct.  1971. 


as  long  as  k  is  smaller  than  a  bound  which  we  shall  derive.  To  establish 

this  formula,  consider  first  the  case  in  which  ?A  <  ?g.  Since  the  two 

group  mean  ridits  average  to  the  overall  mean  r  =  — i—  +  — i — ,  an  individ- 

2  4n  _ 

ual  ridit  in  group  A  will  be  misclassif ied  if  and  only  if  it  exceeds  ?. 
There  are  n  such  possible  ridit  values.  (Note  that  the  case  of  an 
individual  ridit  equaling  the  overall  mean  cannot  occur.)  Out  of  the 
total 


n!n! 

equiprobable  ways  that  n  ridits  may  be  selected  from  2n  for  group  A  (with 
the  remaining  n  obviously  falling  in  group  B) ,  there  are  exactly 


II  » 

k! (n-k) ! 

ways  of  selecting  k  ridits  from  the  n  values  exceeding  r,  and  exactly 


II  . 

(n-k) !k! 

ways  of  selecting  the  n-k  ridits  from  the  n  values  less  than  r  which  then 
result  in  a  correct  classification.  Thus  there  are 


k! (n-k) ! 


equiprobable  ways  that  will  result  in  exactly  k  misclassified  observa¬ 
tions  in  group  A  and  corresponding  k  misclassified  observations  in  group 
B,  provided  rA  <  rg.  When  the  two  group  means  are  equal,  half  of  the 
classifications  are  declared  incorrect,  so  that  k  =  n/2.  As  we  shall 
see  in  the  next  section,  the  bound  for  k  is  less  than  n/2  so  that  this 
case  need  not  be  considered.  In  the  remaining  case  of  rA  >  ?g,  an  ex¬ 
actly  symmetric  argument  shows  there  are 

2 

n! 

k! (n-k) ! 

equiciobable  ways  that  will  result  in  k  ridits  below  r  and  n-k  ridits 
above  r  in  group  A.  Putting  the  two  cases  together,  we  have  the  prob¬ 
ability  formula:  a  two  for  the  two  symmetric  cases,  times  the  number  of 
ways  of  getting  exactly  2k  misclassifications  in  each  case,  times  the 
reciprocal  of  the  total  possible  ways. 


L 


IV.  RANCX  OF  ACCURACY  OF  FORMULA 

The  foregoing  derivation  is  valid,  provided  all  of  the  ways  counted 
actually  fall  into  the  case  being  considered.  In  particular,  for  the 
case  <  rg,  we  must  assure  that  the  k  ridits  larger  than  F  do  not  force 
greater  than  F.  The  greatest  value  of  is  achieved  when  the  k  chosen 

ridits  are  the  last  k  :  1-  1-  k-2  .  l-  i  and  the  n-k 

2n  2n  2n 

chosen  ridits  are  the  largest  ones  belcw  F:  -Jsi2—*  •••> 

2n  2n  2 

Each  of  these  two  sets  is  an  arithmetic  sequence;  the  first  totals 
k(2  -  - Hll_) /2  and  the  second  totals  I  — L_  +  -JSltl—  (n-k)/2  so  that  our 

2n  L  2  2n  J 

restriction  is 


)/2  and  the  second  totals  — L_  +  -JSltl—  (n-k)/2  so  that  our 

L  2  2n  J 

is 

rA  =  [jk(2  -  -|^)/2  +  <4“  +  (*»-k)/2]  /« 


k(4n-k+l)  +  (n+k+1)  (n-k) 


2  . 

=  n  +  n 


+  4nk  -  2k“ 


<  r  =  (2n+l)/4n. 

Cross  multiplying,  the  condition  becomes 

n2  +  n  +  4nk  -  2k2  <  2n2  +  n 

2  2  2 
n  <  2n  -  4nk  +  2k* 

n2  <  2(n-k)2 
n  <  *Jl  (n-k) 

since  k  will  not  exceed  n.  Thus  the  condition  becomes 

V2k  <  Cv/2-Dn 

k  <  n  (2  -  =  .29289n 

to  insure  that  our  count  is  exact.  The  other  case,  ?a  >  rg,  reduces  to 
the  same  condition  when  we  take  the  k  lowest  ridits,  the  n-k  ridits  that 
just  exceed  1/2,  and  require  that  the  mean  exceed  (2n+l)/4n. 
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V.  DISCUSSION 


Numeric  values  of  the  formula  have  been  computed  and  accumulated  in 
table  I.  For  any  n  from  7  to  40,  this  table  presents  the  cumulative 
probability  for  the  last  three  values  of  U  before  the  bound  that  insures 
exact  count  for  that  n,  and  one  value  beyond.  It  also  lists  the  bound. 
For  example,  the  second  entry  opposite  n  *  11  was  calculated  as 

2  £  1  +  ll2  +  552  J  J- 

with  at  least  10  significant  figures,  then  rounded  for  entry  into  the 
table. 


For  a  sample  of  11  from  each  of  two  populations,  either  zero  or  2  mis- 
classifications  in  the  training  set  would  be  highly  significant  indication 
that  the  measure  being  used  is  not  equally  distributed  in  the  two  popula¬ 
tions  and  hence  is  a  likely  discriminator.  Four  misclassif ications  in  the 
training  set  would  be  significant  at  the  0.97.  level,  which  would  still 
appear  to  indicate  a  good  discriminator,  while  6  misclassif ications  of  the 
22  observations  would  be  significant  only  at  a  level  greater  than  8.67.. 

The  fact  that  the  bound  is  6.4  tells  us  that  the  probability  listed  is 
exact  for  6  misclassif ications  or  less. 

The  final  probability  entry  on  each  line  is  not  an  exact  value,  but 
is  listed  because  it  appears  to  be  accurate  enough  for  the  purposes  of  a 
critical  value.  For  n  *  9,  the  value  .3469  approximates  the  exact  value 
.3455;  fcr  n  *  10,  the  approximate  is  exact  to  four  decimal  places.  For 
n  =  11,  we  have  shifted  to  a  different  set  of  U  valut3,  but  the  comparison 
is  still  .3910  true  and  .3949  approximate.  The  approximate  and  exact 
probabilities  for  values  of  n  from  5  through  12  are  given  in  table  II. 

It  appears  that  the  differences  between  the  approximate  and  exact  prob¬ 
abilities  for  values  of  n  above  12  would  not  be  large  enough  to  be  of  any 
real  importance  in  the  selection  of  measures.  Hence  it  would  appear  safe 
to  use  the  fourth  probability  column  of  table  I  for  those  larger  values  of 
n  where  we  do  not  knew  the  exact  value. 

By  use  of  table  I  we  are  able  to  determine  57,  significant  values  up 
to  about  n  *  20,  17.  significant  values  up  to  about  n  *  30,  and  some  idea 
of  the  smaller  significant  values  up  to  n  *  40.  Because  of  the  inequal¬ 
ity  restriction,  this  fonmila  is  of  no  avail  for  values  larger  than  the 
ones  just  summarized. 
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TABLE  I 


oo  oo  l  o'  O'  o'  on| 


TABLE  II 


Approximate  and  exact  probabilities  that  the  number  of 
misciassiflcatiops  in  training  set  S  U 


11 

Approx. 

Exact 

Difference 

Bound  on  U 

m 

.1270 

2.9 

HI 

BBKfvfaPwl 

3.5 

6 

■  II  — 

.8596 

!  .1404  j 

4.1 

6 

.6193 

.5911 

4.7 

6 

.3469 

.3455 

.0014  ! 

5.3 

6 

.1789 

.1789 

5.9 

8 

.3910 

.0039 

6.4 

8 

■^91 

.2201 

.0002 

7.0 

